{
    "cells": [
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<i>Copyright (c) Recommenders contributors.</i>\n",
                "\n",
                "<i>Licensed under the MIT License.</i>\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# Pretraining word and entity embeddings\n",
                "This notebook trains word embeddings and entity embeddings for DKN initializations."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 1,
            "metadata": {},
            "outputs": [],
            "source": [
                "from gensim.test.utils import common_texts, get_tmpfile\n",
                "from gensim.models import Word2Vec\n",
                "import time\n",
                "from utils.general import *\n",
                "import numpy as np\n",
                "import pickle\n",
                "from utils.task_helper import *"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 2,
            "metadata": {},
            "outputs": [],
            "source": [
                "class MySentenceCollection:\n",
                "    def __init__(self, filename):\n",
                "        self.filename = filename\n",
                "        self.rd = None\n",
                "\n",
                "    def __iter__(self):\n",
                "        self.rd = open(self.filename, 'r', encoding='utf-8', newline='\\r\\n')\n",
                "        return self\n",
                "\n",
                "    def __next__(self):\n",
                "        line = self.rd.readline()\n",
                "        if line:\n",
                "            return list(line.strip('\\r\\n').split(' '))\n",
                "        else:\n",
                "            self.rd.close()\n",
                "            raise StopIteration\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 3,
            "metadata": {},
            "outputs": [],
            "source": [
                "InFile_dir = 'data_folder/my'\n",
                "OutFile_dir = 'data_folder/my/pretrained-embeddings'\n",
                "OutFile_dir_KG = 'data_folder/my/KG'\n",
                "OutFile_dir_DKN = 'data_folder/my/DKN-training-folder'"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Wrod2vec [4] can learn high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. We use word2vec algorithm implemented in Gensim [5] to generate word embeddings. \n",
                "<img src=\"https://recodatasets.z20.web.core.windows.net/kdd2020/images%2Fword2vec.JPG\" width=\"300\">"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 4,
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "start to train word embedding... \tdone . \n",
                        "time elapses: 218.2s\n"
                    ]
                }
            ],
            "source": [
                "def train_word2vec(Path_sentences, OutFile_dir):     \n",
                "    OutFile_word2vec = os.path.join(OutFile_dir, r'word2vec.model')\n",
                "    OutFile_word2vec_txt = os.path.join(OutFile_dir, r'word2vec.txt')\n",
                "    create_dir(OutFile_dir)\n",
                "\n",
                "    print('start to train word embedding...', end=' ')\n",
                "    my_sentences = MySentenceCollection(Path_sentences)\n",
                "    model = Word2Vec(my_sentences, size=32, window=5, min_count=1, workers=8, iter=10) # user more epochs for better accuracy\n",
                "\n",
                "    model.save(OutFile_word2vec)\n",
                "    model.wv.save_word2vec_format(OutFile_word2vec_txt, binary=False)\n",
                "    print('\\tdone . ')\n",
                "\n",
                "Path_sentences = os.path.join(InFile_dir, 'sentence.txt')\n",
                "\n",
                "t0 = time.time()\n",
                "train_word2vec(Path_sentences, OutFile_dir)\n",
                "t1 = time.time()\n",
                "print('time elapses: {0:.1f}s'.format(t1 - t0))"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "We leverage a graph embedding model to encode entities into embedding vectors.\n",
                "<img src=\"https://recodatasets.z20.web.core.windows.net/kdd2020/images%2Fkg-embedding-math.JPG\" width=\"600\">\n",
                "<img src=\"https://recodatasets.z20.web.core.windows.net/kdd2020/images%2Fkg-embedding.JPG\" width=\"600\">\n",
                "We use an open-source implementation of TransE (https://github.com/thunlp/Fast-TransX) for generating knowledge graph embeddings:"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 5,
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "/mnt/jialia/kdd2020/recommenders/examples/07_tutorials/KDD2020-tutorial\n",
                        "Cloning into 'Fast-TransX'...\n",
                        "remote: Enumerating objects: 439, done.\u001b[K\n",
                        "remote: Total 439 (delta 0), reused 0 (delta 0), pack-reused 439\u001b[K\n",
                        "Receiving objects: 100% (439/439), 10.01 MiB | 13.27 MiB/s, done.\n",
                        "Resolving deltas: 100% (130/130), done.\n",
                        "epoch 0 449712.375000\n",
                        "epoch 1 393208.781250\n",
                        "epoch 2 337558.531250\n",
                        "epoch 3 331067.187500\n",
                        "epoch 4 306186.406250\n",
                        "epoch 5 284518.781250\n",
                        "epoch 6 267733.031250\n",
                        "epoch 7 247449.140625\n",
                        "epoch 8 229839.609375\n",
                        "epoch 9 213476.515625\n"
                    ]
                }
            ],
            "source": [
                "!bash ./run_transE.sh"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": []
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "DKN take considerations of both the entity embeddings and its context embeddings.\n",
                "<img src=\"https://recodatasets.z20.web.core.windows.net/kdd2020/images/context-embedding.JPG\" width=\"600\">"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 6,
            "metadata": {},
            "outputs": [],
            "source": [
                "##### build context embedding\n",
                "EMBEDDING_LENGTH = 32\n",
                "entity_file = os.path.join(OutFile_dir_KG, 'entity2vec.vec') \n",
                "context_file = os.path.join(OutFile_dir_KG, 'context2vec.vec')   \n",
                "kg_file = os.path.join(OutFile_dir_KG, 'train2id.txt')   \n",
                "gen_context_embedding(entity_file, context_file, kg_file, dim=EMBEDDING_LENGTH)"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 7,
            "metadata": {},
            "outputs": [],
            "source": [
                "load_np_from_txt(\n",
                "        os.path.join(OutFile_dir_KG, 'entity2vec.vec'),\n",
                "        os.path.join(OutFile_dir_DKN, 'entity_embedding.npy'),\n",
                "    )\n",
                "load_np_from_txt(\n",
                "        os.path.join(OutFile_dir_KG, 'context2vec.vec'),\n",
                "        os.path.join(OutFile_dir_DKN, 'context_embedding.npy'),\n",
                "    )\n",
                "format_word_embeddings(\n",
                "    os.path.join(OutFile_dir, 'word2vec.txt'),\n",
                "    os.path.join(InFile_dir, 'word2idx.pkl'),\n",
                "    os.path.join(OutFile_dir_DKN, 'word_embedding.npy')\n",
                ")\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Reference\n",
                "\\[1\\] Wang, Hongwei, et al. \"DKN: Deep Knowledge-Aware Network for News Recommendation.\" Proceedings of the 2018 World Wide Web Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2018.<br>\n",
                "\\[2\\] Knowledge Graph Embeddings including TransE, TransH, TransR and PTransE. https://github.com/thunlp/KB2E <br>\n",
                " of the 58th Annual Meeting of the Association for Computational Linguistics. https://msnews.github.io/competition.html <br>\n",
                "\\[3\\] GloVe: Global Vectors for Word Representation. https://nlp.stanford.edu/projects/glove/ <br>\n",
                "\\[4\\] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2 (NIPS’13). Curran Associates Inc., Red Hook, NY, USA, 3111–3119. <br>\n",
                "\\[5\\] Gensim  Word2vec embeddings : https://radimrehurek.com/gensim/models/word2vec.html <br>"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": []
        }
    ],
    "metadata": {
        "kernelspec": {
            "display_name": "Python (kdd tutorial)",
            "language": "python",
            "name": "kdd_tutorial_2020"
        },
        "language_info": {
            "codemirror_mode": {
                "name": "ipython",
                "version": 3
            },
            "file_extension": ".py",
            "mimetype": "text/x-python",
            "name": "python",
            "nbconvert_exporter": "python",
            "pygments_lexer": "ipython3",
            "version": "3.6.10"
        }
    },
    "nbformat": 4,
    "nbformat_minor": 2
}